Welcome to tidyverse

(A few remarks and tips before the practical session)


tidyverse.org



Nine “core” R packages and a “philosophy of data science design” which inspired many many more specialized packages.

link to the paper

What is tidyverse?

The tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy […] so that learning one package makes it easier to learn the next.

The tidyverse encompasses the repeated tasks at the heart of every data science project: data import, tidying, manipulation, visualisation, and programming.

This is still very abstract

In the spirit of hands-on interactivity, we will leave “theory” and practice work hand-in-hand during exercises.

Further companion study material

https://r4ds.hadley.nz

Let’s talk about our example data

“Western Eurasia witnessed several large-scale human migrations during the Holocene. Here, to investigate the cross-continental effects of these migrations, we shotgun-sequenced 317 genomes—mainly from the Mesolithic and Neolithic periods—from across northern and western Eurasia. These were imputed alongside published data to obtain diploid genotypes from more than 1,600 ancient humans [and about 2,500 present-day humans].”

Our exercises will focus on two MesoNeo data sets:

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

Why those two data sets?

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

  1. Best representatives of modern population genetic data
  2. Lots of opportunities to practice tidyverse data processing
  3. Even more opportunities to showcase ggplot2 possibilities

The main reason…

A great example of how to approach totally unfamiliar data!


True story.


Recently, I was given this exact data set. I had to find my way around it, and figure out how to build a project around it.

The exercises are retracing my own data exploration journey!

Let’s get started!

  1. Go to www.bodkan.net/simgen
  2. Click on “Introduction to tidyverse in the left panel
  • This session will focus on the metadata
  • “More tidyverse practice” will dig into the IBD data set
  1. “Cheatsheets and handouts” section in the left panel has a single-page version of these slides and the dplyr cheatsheet
  2. Open your RStudio and start working!